22 research outputs found
Context Vectors are Reflections of Word Vectors in Half the Dimensions
This paper takes a step towards theoretical analysis of the relationship
between word embeddings and context embeddings in models such as word2vec. We
start from basic probabilistic assumptions on the nature of word vectors,
context vectors, and text generation. These assumptions are well supported
either empirically or theoretically by the existing literature. Next, we show
that under these assumptions the widely-used word-word PMI matrix is
approximately a random symmetric Gaussian ensemble. This, in turn, implies that
context vectors are reflections of word vectors in approximately half the
dimensions. As a direct application of our result, we suggest a theoretically
grounded way of tying weights in the SGNS model
Syllable-aware Neural Language Models: A Failure to Beat Character-aware Ones
Syllabification does not seem to improve word-level RNN language modeling
quality when compared to character-based segmentation. However, our best
syllable-aware language model, achieving performance comparable to the
competitive character-aware model, has 18%-33% fewer parameters and is trained
1.2-2.2 times faster.Comment: EMNLP 201
Experiments with Russian to Kazakh sentence alignment
Sentence alignment is the final step in building parallel corpora, which arguably has the greatest impact on the quality of a resulting corpus and the accuracy of machine translation systems that use it for training. However, the quality of sentence alignment itself depends on a number of factors. In this paper we investigate the impact of several data processing techniques on the quality of sentence alignment. We develop and use a number of automatic evaluation metrics, and provide empirical evidence that application of all of the considered data processing techniques yields bitexts with the lowest ratio of noise and the highest ratio of parallel sentences
Context Vectors Are Reflections of Word Vectors in Half the Dimensions
https://arxiv.org/pdf/1902.09859.pdfThis paper takes a step towards theoretical analysis of the relationship between word embeddings and context embeddings in models such as word2vec. We start from basic probabilistic assumptions on the nature of word vectors, context vectors, and text generation. These assumptions are supported either empirically or theoretically by the existing literature. Next, we show that under these assumptions the widely-used word-word PMI matrix is approximately a random symmetric Gaussian ensemble. This, in turn, implies that context vectors are reflections of word vectors in approximately half the dimensions. As a direct application of our result, we suggest a theoretically grounded way of tying weights in the SGNS model
Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic
Classes of target functions containing a large number of approximately
orthogonal elements are known to be hard to learn by the Statistical Query
algorithms. Recently this classical fact re-emerged in a theory of
gradient-based optimization of neural networks. In the novel framework, the
hardness of a class is usually quantified by the variance of the gradient with
respect to a random choice of a target function.
A set of functions of the form , where is taken from
, has attracted some attention from deep learning theorists and
cryptographers recently. This class can be understood as a subset of
-periodic functions on and is tightly connected with a class
of high-frequency periodic functions on the real line.
We present a mathematical analysis of limitations and challenges associated
with using gradient-based learning techniques to train a high-frequency
periodic function or modular multiplication from examples. We highlight that
the variance of the gradient is negligibly small in both cases when either a
frequency or the prime base is large. This in turn prevents such a learning
algorithm from being successful
Long-Tail Theory under Gaussian Mixtures
We suggest a simple Gaussian mixture model for data generation that complies
with Feldman's long tail theory (2020). We demonstrate that a linear classifier
cannot decrease the generalization error below a certain level in the proposed
model, whereas a nonlinear classifier with a memorization capacity can. This
confirms that for long-tailed distributions, rare training examples must be
considered for optimal generalization to new data. Finally, we show that the
performance gap between linear and nonlinear models can be lessened as the tail
becomes shorter in the subpopulation frequency distribution, as confirmed by
experiments on synthetic and real data.Comment: accepted to ECAI 202